Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.
| Term | Definition |
|---|---|
| Instances | Input. Vectors of attributes that defines your input space. |
| Concept | Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function. |
| Target Concept | The correct concept to map instances to oututs. |
| Hypothesis Class | Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output). |
| Sample | Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning. |
| Candidate Concept | Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy) |
| Testing Set | Set of instances to test a candidate concept. |
Work through each instance and give an answer based on the decision tree.
Note:
"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.
Notes:
Example: Boolean expressions as decision tress
| Type | Size of Nodes | Description |
|---|---|---|
| n-OR ("any" fn) | linear $n$ | Easy to solve |
| n-XOR ("odd/even parity" fn)* | $O(2^{n})$ | Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes). |
Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.
* If number of attributes that are true is odd, then the output is true.
Q: Exactly how expressive is a decision tree?
Given:
A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation
Loop:
1. Pick best attribute A
2. Assign A as decision attribute for Node
3. For each value A, create a descendant of Node
4. Sort training examples to leaves
5. Stop if examples perfectly classified
6. Else iterate over leaves
Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.
Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"
Formula:
Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$
Types of bias:
ID3's inductive bias - Types of preferences:
Q: How to build DTs with continuous attributes?
Q: T/F, does it make sense to repeat an attribute along a path in a tree?
Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.
Solutions
Regression implies continuous outputs. Information gain doesn't work well with continuous values.
Q: What's the splitting criteria for regression trees?
Q: What's the output?
Etymology: "Regression to the mean"
Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).
Training data has error that doesn't model $f$, results in $f+\epsilon$
Types of error sources:
Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.
Aside - Assumptions of SL:
- Train/test sets are IID (independent and identically distributed)
Applied to housing example in video:
Example of a perceptron (binary classifier with 3 input + weights)
Quiz: Example inputs and estimated y
Image: Representation of activation space given 2 data + weights. Generates a halfplane.
Takeaway: Perceptrons are linear functions that computes a hyperplane.
Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?
Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?
Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?
Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.
# Perceptron Quiz answer in code
def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
return 1 if activation >= theta else 0
def test_xor(w1, w2, w_and, theta):
assert xor_network(0,0, w1, w2, w_and, theta) == 0
assert xor_network(0,1, w1, w2, w_and, theta) == 1
assert xor_network(1,0, w1, w2, w_and, theta) == 1
assert xor_network(1,1, w1, w2, w_and, theta) == 0
print("Passed all tests")
w_1 = 1
w_2 = 1
w_and = -2
threshold = 1
test_xor(w_1, w_2, w_and, threshold)
Answer to XOR question:
Definition: Given examples, find weights that map inputs to outputs.
2 ways in which perceptron training is done:
Goal: How to set the weights of a single unit so that it matches a training set.
Components:
(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.
$$ w_i = w_i + \Delta w_i $$(2) Defining weight change amount - First, calculate $\hat{y}$.
(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.
$$ \Delta w_i = \eta (y-\hat{y})x_i $$Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.
Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.
Q: What if the data is not linearly separable?
Motivation: More robust to linearly non-separable datasets.
Components:
(1) Calculate the error metric. Goal is to minimize this error.
(2) Find the partial derivative of above.
(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.
$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$| Type | Learning Rule | Pros | Cons |
|---|---|---|---|
| Perceptron Rule | $$\Delta w_i = \eta (y-\hat{y})x_i$$ | Guaranteed finite convergence | Requires linear separability |
| Gradient Descent | $$\Delta w_i = \eta (y-\alpha)x_i$$ | More robust to non-linearly seprable data | Likely to converge only on local optimum |
(1) Sigmoid activation function $$ \sigma(x) = \frac{1}{1+e^{-x}} $$
(2) Derivative of the sigmoid function $$ \sigma`(x) = \sigma(x)(1-\sigma(x)) $$
Extra: How to take the derivative of the sigmoid function. Reference: TDS
Key Points:
Alternatives/supplements to vanilla gradient descent:
Definition: Representational power of the data structure you're using. Tells you the set of hypotheses you're willing to consider.
Types of neural nets and their representational power:
Definition: The algorithm's selection (ie preference) of one representation over another.
Preference bias of neural nets: Small Random values that provide simpler explanations.
- SMALL: Favors small value, because networks with small numbers have low(er) complexity. Larger values tend to overfit.
- RANDOM: Variability. Possibility of exploring outside of local minimia.
- Similar to __Occam's Razor__: "Entities should not be multiplied unnecessarily"
- Algorithm prefers stopping once it fits the data well, before letting weights to grow more.
| Pros | Cons |
|---|---|
| Remembers (no catastrophic forgetting) | Cannot generalize |
| Fast (no learning) | Slower at runtime |
| Simple (Occams Razor) | Easily overfits (Remembers noise as well) |
| Struggles with complexity (e.g. 2 same x with different y, multiple near distance neighbors with different y values) |
Given:
(1) Find $K$ nearest neighbors of $q$ $$ NearestNeighbor = \{i:d(q,x_i)k \text{ smallest}\} $$
(2.1) Classification: Return the (weighted) vote of $y_i \in \text{NearestNeighbor}$ (ie. Mode)
(2.2) Regression: Return the (weighted) mean of $y_i \in \text{NearestNeighbor}$
Takeaways:
How to calculate Manhattan/Euclidean distance: Both are derived from Mikowski distance ("p-norm distance"), with different degrees of $p$ value:
Using the first row as an example (x=(1,6), y=7):
Finding 1/3-NN:
(2,4), 8 and (7,1), 50. $y=\mu{(8, 50)}=29$(2,4), 8. $y=\sqrt{8}$ Error:
Takeaways:
Definition: As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.
Example: Number of neighbors needed to cover the same dimension space.
Example: Locally weighted linear regression
Idea: Build up a bunch of rules and combine them together until you got something that's good enough.
Pseudocode:
1. Learn over a subset of data (uniform & random) and apply a learner
...(repeat)
2. Combine all the rules (How? Regression: Mean, Classification: Mode)
Example: Spam Email
Takeaway:
Pseudocode (high-level):
1. Learn over a subset of data apply a learner
2. Filter down to the "hardest" examples, ie. data points that results in highest errors.
...(repeat)
3. Combine all the rules (How? Regression: WEIGHTED mean, Classification: WEIGHTED Mode)
Classification error = # of mismatches
Given:
Definition: A learning algorithm that, under any distribution, will find a hypothesis that performs better than chance on the current distribution.
"Does better than chance" expressed more formally: For all distributions, your learner will have an expected error (hypothesis disagreeing with the real target of a single sample)
$$ \forall_D Pr_D [\cdot] \leq 1/2 - \epsilon $$ANSWER:
1/4.Some explanation in Slack thread:
One can ask a simple question: given a set of objects, X, what is the probability of seeing any particular object, x, that is drawn from X according to some distribution D? By definition, D, induces a value for each x that is between 0 and 1 inclusive where the sum of all these values is 1.
In our case, each x has some corresponding label, let’s call each on y.
A weak learner is simply a learning function that given X, D, Y, and a set of hypotheses, H, will find some hypothesis, h, drawn from H, such that the probability of drawing an x such that h(x) == y is better than chance.
We generate the hypothesis by getting the sum of all of our alpha times its hypothesis, then run it through a sign function (returns the sign, ie. -1, 1, 0 of the number).
Formula for final hypothesis $H$:
$$ \begin{align} H_{final}(x) = sgn(\Sigma_t \alpha_t h_t (x)) \end{align} $$Supporting material/notes for this section: My notes from ISYE6501
Given:
Q: Solve for the line that is described by their difference (...?)
Answer:
Idea: We can turn the $\max \{ \frac{2}{||w||}\}$ solution into a quadratic programming function.
Takeaway: All 3 optimization models are solving the same problem, but quadratic is easier to solve:
Goal: After finding an alpha that maximizes the quadratic optimization program, you can solve for the params of the hyperplane ($w$). Once we find the param, we can solve for $b$ by plugging in $w$ and $x_i$. This gives us $wx+b$.
Some considerations:
Movitation: We can separate non-linearly seprable data points by using the kernel trick.
Takeaway:
Goals - Find a formal way of addressing:
Definitions:
Tools: Use same tools for analyzing algorithms in CS.
Resources that matter in computational learning theory:
How training examples are selected, from a teacher/learner paradigm:
In the context of the game "20 questions":
Q1: How many questions does the student need to ask, if Teachers chooses the questions?
Q2: How many question does the learner must choose before it narrows down the possibility to 1 person?
Given:
Q1: Given the table of examples and hypotheses, figure out the combination of positive/absent/negated to get the hypothesis.
Q2: Harder problem - we don't have access to hypothesis beforehand.
Takeaway: Constrained set of queries make it hard to consistently reduce the hypothesis space in half, so realistically it is going to take more time than log N time (possibly exponential).
Reference: Mitchell, Ch. 7 "Computaional Learning Theory" pg. 220.
Key Idea: "How many mistakes will the learner make in its predictions before it learns the target concept?"
Key Idea: Set hypothesis as specific as possible, then generalize as you see more positive examples.
Takeaway:
Takeaway: True Error = penalized proportional to the probability of getting the hypothesis incorrect.
Definition: Learner is PAC learnable if it can learn to get a low error and return hypothesis with relatively high confidence (ie. delta) - in time that is polynomial w.r.t error, certainty and number of hypotheses.
Formula: A problem is PAC learnable if the version space of sample S is epsilon exhausted.
Key formula to find the bound: $$ m \geq \frac{1}{\epsilon} (ln|H| + ln\frac{1}{\delta}) $$
Takeaway: Haussler theorem gives a you formula (boxed in red) that tells you how many training examples you need in order to meet the epsilon threshold.
Takeaway: Answer (40) is much better than the input space (2**10, ie 1024).
Reference:
Motivation: In the previous lesson, we covered Haussler's theorem ($m \geq \frac{1}{\epsilon} (ln|h| + ln\frac{1}{\delta})$), but its success was limited to finite hypothesis space (H). How do we handle very large, or infinite hypothesis space?
Goal: Connect insight about VC dimensions to understanding sample complexity.
Many learners actually have infinite hypothesis space, rather than finite:
The lectures mentions the difference between syntactic vs semantic hypothesis space:
Significance: We can reduce the dimensionality of syntactic hypothesis space (infinite) into a finite hypothesis space. This is the underlying concept of determining VC dimensions.
From Mitchell (pg.215)
VC dimension $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size of the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H) = \infty$
Definition from lecture: What is the largest set of inputs that the hypothesis class can "shatter", ie. label in all possible ways?
From Mitchell (pg.214)
A set of instances S is shattered by hypothesis space $H$ IFF for every dichotomy of S there exists some hypothesis in $H$ consistent with this dichotomy.
- If a learner can correctly separate a set of points, it "shatters" those points.
Question: Figure out the largest set we can shatter using the given hypotheses, given:
Answer: $VC=2$
[+][+ +][+] -+ [-]- - []+ - + because the interval has to be contiguous.Question: Figure out the largest set we can shatter using the given hypotheses, given:
Answer: $VC \geq 3$
+ - + - matrix since the lines split a plane that has to be contiguousTakeaway: VC dimensions often align to the number of parameters in the hypothesis space.
Takeaway: Convex polygon has infinite VC dimensions, because it can have arbitrarily many input points.
Given:
We can determine the right sample size to meet the error rate $\epsilon$ and probability $1-\delta$ for an infinite hypothesis class.
$$ m \geq \frac{1}{\epsilon}(8 VC(H) log_2 \frac{13}{\epsilon} + 4 log_2 \frac{2}{\delta}) $$This can be simplified to reflect a finite hypothesis class.
$$ m \geq \frac{1}{\epsilon}(\ln{|H|} + \ln{\frac{1}{\delta}}) $$Takeaway: H is PAC-learnable IFF VC dimension is finite. VC dimensions captures the idea of PAC learnability in a single concept.
From dev.to%3A-,The%20VC%20of%20Finite%20Hypothesis%20Space,-If%20we%20denote)
If we denote the VC of Finite Hypothesis Space by d, there has to be 2^d distinct concepts (as each different labelling can be captured by a different hypothesis in a class) - therefore 2^d is less than or equal to the number of hyptheses |H|.
Rearranging, d <= log2 (|H|). So a finite hypothesis class gives us finite VC dimensions, and therefore make things "PAC Learnable". There is a further proof that this goes the other way - H is PAC Learnable if and only iff the VC Dimension is finite. VC Dimension captures the notion of PAC Learnability in a single concept.
From CMU lecture notes:
Through our discussions of VC theory we have seen that we can improve generalization by controlling the complexity of the concept class H from which we are choosing a hypothesis. We saw that the shatter coefficient and VC dimension were useful measures of complexity because we could bound the performance of a learning algorithm in terms of these quantities and the amount of data we have.
Reference:
Motivation:
Key Idea: Find the hypothesis $h$ from hypothesis class $H$ that is the most probable given the data $D$. $$ argmax_{h \in H} P(h | D) $$
Mitchell, p156
Bayes theorem ... provides a way to calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h)
Given:
Bayes theorem $$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$
Takeaway: Bayes rule allows us to switch $P(h|D)$ to $P(D|h)$.
A man goes to see a doctor. She gives him a lab test. The test return correct positive 98% of the time, and a correct negative 97% of the time. The test looks for a rare disease that only 0.8% of the population has it. The test comes out positive, does the man have the rare disease?
Given:
We can apply the Bayes Theorem to get the probability of having the disease given that the test returns positive: $$ P(disease|testPositive) = \frac{P(testPositive|disease) P(disease)}{P(testPositive)} \\ = \frac{P(testPositive|disease) P(disease)}{P(testPositive) P(disease) + P(1 - testNegative) P(1 - disease)} \\ = \frac{.98 * 0.008}{(.98 * 0.008) + (0.03 *(1 - 0.008))} \\ = 0.2085106383 \\ = 21\% $$
ANSWER: Probably not. Even if the man tests positive there is only 21% chance that he has the disease.
Takeaway:
Both are derived from the initial calculation: For each hypothesis in $H$, calculate $P(h|D)$
$$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$